130 research outputs found

    Modeling the Evolution of Regulatory Elements by Simultaneous Detection and Alignment with Phylogenetic Pair HMMs

    Get PDF
    The computational detection of regulatory elements in DNA is a difficult but important problem impacting our progress in understanding the complex nature of eukaryotic gene regulation. Attempts to utilize cross-species conservation for this task have been hampered both by evolutionary changes of functional sites and poor performance of general-purpose alignment programs when applied to non-coding sequence. We describe a new and flexible framework for modeling binding site evolution in multiple related genomes, based on phylogenetic pair hidden Markov models which explicitly model the gain and loss of binding sites along a phylogeny. We demonstrate the value of this framework for both the alignment of regulatory regions and the inference of precise binding-site locations within those regions. As the underlying formalism is a stochastic, generative model, it can also be used to simulate the evolution of regulatory elements. Our implementation is scalable in terms of numbers of species and sequence lengths and can produce alignments and binding-site predictions with accuracy rivaling or exceeding current systems that specialize in only alignment or only binding-site prediction. We demonstrate the validity and power of various model components on extensive simulations of realistic sequence data and apply a specific model to study Drosophila enhancers in as many as ten related genomes and in the presence of gain and loss of binding sites. Different models and modeling assumptions can be easily specified, thus providing an invaluable tool for the exploration of biological hypotheses that can drive improvements in our understanding of the mechanisms and evolution of gene regulation

    Allelic Gene Structure Variations in Anopheles gambiae Mosquitoes

    Get PDF
    Allelic gene structure variations and alternative splicing are responsible for transcript structure variations. More than 75% of human genes have structural isoforms of transcripts, but to date few studies have been conducted to verify the alternative splicing systematically.The present study used expressed sequence tags (ESTs) and EST tagged SNP patterns to examine the transcript structure variations resulting from allelic gene structure variations in the major human malaria vector, Anopheles gambiae. About 80% of 236,004 available A. gambiae ESTs were successfully aligned to A. gambiae reference genomes. More than 2,340 transcript structure variation events were detected. Because the current A. gambiae annotation is incomplete, we re-annotated the A. gambiae genome with an A. gambiae-specific gene model so that the effect of variations on gene coding could be better evaluated. A total of 15,962 genes were predicted. Among them, 3,873 were novel genes and 12,089 were previously identified genes. The gene completion rate improved from 60% to 84%. Based on EST support, 82.5% of gene structures were predicted correctly. In light of the new annotation, we found that approximately 78% of transcript structure variations were located within the coding sequence (CDS) regions, and >65% of variations in the CDS regions have the same open-reading-frame. The association between transcript structure isoforms and SNPs indicated that more than 28% of transcript structure variation events were contributed by different gene alleles in A. gambiae.We successfully expanded the A. gambiae genome annotation. We predicted and analyzed transcript structure variations in A. gambiae and found that allelic gene structure variation plays a major role in transcript diversity in this important human malaria vector

    Improved annotation with <i>de novo</i> transcriptome assembly in four social amoeba species

    Get PDF
    Background: Annotation of gene models and transcripts is a fundamental step in genome sequencing projects. Often this is performed with automated prediction pipelines, which can miss complex and atypical genes or transcripts. RNA sequencing (RNA-seq) data can aid the annotation with empirical data. Here we present de novo transcriptome assemblies generated from RNA-seq data in four Dictyostelid species: D. discoideum, P. pallidum, D. fasciculatum and D. lacteum. The assemblies were incorporated with existing gene models to determine corrections and improvement on a whole-genome scale. This is the first time this has been performed in these eukaryotic species. Results: An initial de novo transcriptome assembly was generated by Trinity for each species and then refined with Program to Assemble Spliced Alignments (PASA). The completeness and quality were assessed with the Benchmarking Universal Single-Copy Orthologs (BUSCO) and Transrate tools at each stage of the assemblies. The final datasets of 11,315-12,849 transcripts contained 5,610-7,712 updates and corrections to >50% of existing gene models including changes to hundreds or thousands of protein products. Putative novel genes are also identified and alternative splice isoforms were observed for the first time in P. pallidum, D. lacteum and D. fasciculatum. Conclusions: In taking a whole transcriptome approach to genome annotation with empirical data we have been able to enrich the annotations of four existing genome sequencing projects. In doing so we have identified updates to the majority of the gene annotations across all four species under study and found putative novel genes and transcripts which could be worthy for follow-up. The new transcriptome data we present here will be a valuable resource for genome curators in the Dictyostelia and we propose this effective methodology for use in other genome annotation projects

    Computational Analysis and Experimental Validation of Gene Predictions in Toxoplasma gondii

    Get PDF
    Toxoplasma gondii is an obligate intracellular protozoan that infects 20 to 90% of the population. It can cause both acute and chronic infections, many of which are asymptomatic, and, in immunocompromised hosts, can cause fatal infection due to reactivation from an asymptomatic chronic infection. An essential step towards understanding molecular mechanisms controlling transitions between the various life stages and identifying candidate drug targets is to accurately characterize the T. gondii proteome.We have explored the proteome of T. gondii tachyzoites with high throughput proteomics experiments and by comparison to publicly available cDNA sequence data. Mass spectrometry analysis validated 2,477 gene coding regions with 6,438 possible alternative gene predictions; approximately one third of the T. gondii proteome. The proteomics survey identified 609 proteins that are unique to Toxoplasma as compared to any known species including other Apicomplexan. Computational analysis identified 787 cases of possible gene duplication events and located at least 6,089 gene coding regions. Commonly used gene prediction algorithms produce very disparate sets of protein sequences, with pairwise overlaps ranging from 1.4% to 12%. Through this experimental and computational exercise we benchmarked gene prediction methods and observed false negative rates of 31 to 43%.This study not only provides the largest proteomics exploration of the T. gondii proteome, but illustrates how high throughput proteomics experiments can elucidate correct gene structures in genomes

    RNA-Seq improves annotation of protein-coding genes in the cucumber genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As more and more genomes are sequenced, genome annotation becomes increasingly important in bridging the gap between sequence and biology. Gene prediction, which is at the center of genome annotation, usually integrates various resources to compute consensus gene structures. However, many newly sequenced genomes have limited resources for gene predictions. In an effort to create high-quality gene models of the cucumber genome (<it>Cucumis sativus </it>var. <it>sativus</it>), based on the EVidenceModeler gene prediction pipeline, we incorporated the massively parallel complementary DNA sequencing (RNA-Seq) reads of 10 cucumber tissues into EVidenceModeler. We applied the new pipeline to the reassembled cucumber genome and included a comparison between our predicted protein-coding gene sets and a published set.</p> <p>Results</p> <p>The reassembled cucumber genome, annotated with RNA-Seq reads from 10 tissues, has 23, 248 identified protein-coding genes. Compared with the published prediction in 2009, approximately 8, 700 genes reveal structural modifications and 5, 285 genes only appear in the reassembled cucumber genome. All the related results, including genome sequence and annotations, are available at <url>http://cmb.bnu.edu.cn/Cucumis_sativus_v20/</url>.</p> <p>Conclusions</p> <p>We conclude that RNA-Seq greatly improves the accuracy of prediction of protein-coding genes in the reassembled cucumber genome. The comparison between the two gene sets also suggests that it is feasible to use RNA-Seq reads to annotate newly sequenced or less-studied genomes.</p

    Towards an Evolutionary Model of Transcription Networks

    Get PDF
    DNA evolution models made invaluable contributions to comparative genomics, although it seemed formidable to include non-genomic features into these models. In order to build an evolutionary model of transcription networks (TNs), we had to forfeit the substitution model used in DNA evolution and to start from modeling the evolution of the regulatory relationships. We present a quantitative evolutionary model of TNs, subjecting the phylogenetic distance and the evolutionary changes of cis-regulatory sequence, gene expression and network structure to one probabilistic framework. Using the genome sequences and gene expression data from multiple species, this model can predict regulatory relationships between a transcription factor (TF) and its target genes in all species, and thus identify TN re-wiring events. Applying this model to analyze the pre-implantation development of three mammalian species, we identified the conserved and re-wired components of the TNs downstream to a set of TFs including Oct4, Gata3/4/6, cMyc and nMyc. Evolutionary events on the DNA sequence that led to turnover of TF binding sites were identified, including a birth of an Oct4 binding site by a 2nt deletion. In contrast to recent reports of large interspecies differences of TF binding sites and gene expression patterns, the interspecies difference in TF-target relationship is much smaller. The data showed increasing conservation levels from genomic sequences to TF-DNA interaction, gene expression, TN, and finally to morphology, suggesting that evolutionary changes are larger at molecular levels and smaller at functional levels. The data also showed that evolutionarily older TFs are more likely to have conserved target genes, whereas younger TFs tend to have larger re-wiring rates

    Genome of the facultative scuticociliatosis pathogen Pseudocohnilembus persalinus provides insight into its virulence through horizontal gene transfer

    Get PDF
    This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ The attached file is the published version of the article

    The Viral and Cellular MicroRNA Targetome in Lymphoblastoid Cell Lines

    Get PDF
    Epstein-Barr virus (EBV) is a ubiquitous human herpesvirus linked to a number of B cell cancers and lymphoproliferative disorders. During latent infection, EBV expresses 25 viral pre-microRNAs (miRNAs) and induces the expression of specific host miRNAs, such as miR-155 and miR-21, which potentially play a role in viral oncogenesis. To date, only a limited number of EBV miRNA targets have been identified; thus, the role of EBV miRNAs in viral pathogenesis and/or lymphomagenesis is not well defined. Here, we used photoactivatable ribonucleoside-enhanced crosslinking and immunoprecipitation (PAR-CLIP) combined with deep sequencing and computational analysis to comprehensively examine the viral and cellular miRNA targetome in EBV strain B95-8-infected lymphoblastoid cell lines (LCLs). We identified 7,827 miRNA-interaction sites in 3,492 cellular 3′UTRs. 531 of these sites contained seed matches to viral miRNAs. 24 PAR-CLIP-identified miRNA:3′UTR interactions were confirmed by reporter assays. Our results reveal that EBV miRNAs predominantly target cellular transcripts during latent infection, thereby manipulating the host environment. Furthermore, targets of EBV miRNAs are involved in multiple cellular processes that are directly relevant to viral infection, including innate immunity, cell survival, and cell proliferation. Finally, we present evidence that myc-regulated host miRNAs from the miR-17/92 cluster can regulate latent viral gene expression. This comprehensive survey of the miRNA targetome in EBV-infected B cells represents a key step towards defining the functions of EBV-encoded miRNAs, and potentially, identifying novel therapeutic targets for EBV-associated malignancies

    Single nucleus genome sequencing reveals high similarity among nuclei of an endomycorrhizal fungus

    Get PDF
    Nuclei of arbuscular endomycorrhizal fungi have been described as highly diverse due to their asexual nature and absence of a single cell stage with only one nucleus. This has raised fundamental questions concerning speciation, selection and transmission of the genetic make-up to next generations. Although this concept has become textbook knowledge, it is only based on studying a few loci, including 45S rDNA. To provide a more comprehensive insight into the genetic makeup of arbuscular endomycorrhizal fungi, we applied de novo genome sequencing of individual nuclei of Rhizophagus irregularis. This revealed a surprisingly low level of polymorphism between nuclei. In contrast, within a nucleus, the 45S rDNA repeat unit turned out to be highly diverged. This finding demystifies a long-lasting hypothesis on the complex genetic makeup of arbuscular endomycorrhizal fungi. Subsequent genome assembly resulted in the first draft reference genome sequence of an arbuscular endomycorrhizal fungus. Its length is 141 Mbps, representing over 27,000 protein-coding gene models. We used the genomic sequence to reinvestigate the phylogenetic relationships of Rhizophagus irregularis with other fungal phyla. This unambiguously demonstrated that Glomeromycota are more closely related to Mucoromycotina than to its postulated sister Dikarya
    • …
    corecore